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Abstract 

Adaptive variable-length codes associate a variable-length codeword to the symbol 
' being encoded depending on the previous symbols in the input string. This class of 

codes has been recently presented in [Drago§ Trinca, arXiv:cs.DS/0505007] as a new 
class of non-standard variable-length codes. New algorithms for data compression, 
based on adaptive variable-length codes of order one and Huffman's algorithm, have 
been recently presented in [Drago§ Trinca, ITCC 2004]. In this paper, we extend 
the work done so far by the following contributions: first, we propose an improved 
generalization of these algorithms, called EAHn. Second, we compute the entropy 
bounds for EAHn, using the well-known bounds for Huffman's algorithm. Third, 
we discuss implementation details and give reports of experimental results obtained 
\q \ on some well-known corpora. Finally, we describe a parallel version of EAHn using 

' the PRAM model of computation. 
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1 Introduction 



With the continuous growth of the Internet, the need of rapid network com- 
munications, and the completion of the Human Genome, data compression 
seems to remain an important and attractive research area. 

One of the earliest and most studied data compression techniques is the well- 
known Huffman's classical algorithm [6]. Even if there are a lot of algorithms 
which have been developed in the last decades, and which significantly out- 
perform Huffman's algorithm in terms of compression performance, it seems 
that the classical Huffman coding is still the subject of many studies. We give 
two illustrative examples. 
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First, it is well-known that Huffman coding has been - and still is - success- 
fully used as an intermediate entropy-coding step in the Burrows- Wheeler's 
compression algorithm introduced in 1994 [4]. After its publication, their algo- 
rithm has been improved a lot, and is currently available on multiple platforms 
[13] as a general-purpose compression algorithm. Second, it seems that Huff- 
man coding is currently used in industry [5] to develop compression algorithms 
and protocols for rapid satellite communications. 

From a practical perspective, using Huffman coding as an intermediate step 
in other compression schemes seems a bit surprising, since it performs poorly 
in practice compared with other algorithms. Some of the reasons behind this 
choice are the following: 

(1) Huffman's algorithm has a good running time compared with other com- 
pression algorithms. Therefore, when the runtime is a critical parameter 
in the system, this seems a reasonable choice. 

(2) Usually, when used as an intermediate entropy-coding step, it gives good 
results compared with other compression techniques. 

In this paper, we present a new coding technique called EAHn (i.e., Encoder 
based on Adaptive variable- length codes of order n and Huffman's algorithm), 
which is a generalization of the algorithms recently presented in [18] . EAHn is 
not intended to be used as a single encoder. Instead, it is intended to replace 
Huffman coding as an intermediate step in other compression schemes. 

The paper is organized as follows. First, we recall some basic definitions and 
notations related to adaptive variable- length codes (section 2). In section 3, we 
present in detail the new algorithm EAHn, by discussing both offline and on- 
line versions. After describing the algorithm in detail, we compute its entropy 
bounds using the well-known entropy bounds for Huffman's algorithm (sec- 
tion 4). Implementation details and reports of experimental results obtained 
on some well-known corpora are provided in sections 5 and 6. As we shall 
see, using EAHn instead of Huffman's algorithm, either as an intermediate 
step in other compression schemes or as a single encoder, is preferred, since 
it gives significantly better results. In section 7, we describe a parallel version 
of EAHn using the PRAM model of computation. Finally, in the last section, 
we discuss some future work directions. 



2 Adaptive variable-length codes 

Adaptive variable-length codes have been recently presented in [18] as a new 
class of non-standard variable-length codes. The aim of this section is to briefly 
review some basic definitions and notations. For more details, the reader is 
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referred to [17,18]. 



We denote by \S\ the cardinality of the set S; if x is a string of finite length, 
then \x\ denotes the length of x. The empty string is denoted by A. For an 
alphabet S, we denote by S n the set {sis 2 ■ ■ ■ s n | Sj G £ for all i}, by S* the 
set U^° = o s "> and by £+ the set U^°=i S", where S° denotes the set {A}. Also, 
we denote by the set UHo S N and h Y S " n the set U= n £*■ 

Let X be a finite and nonempty subset of S + , and w G S + . A decomposition 
of w over X is any sequence of strings 111,112, ■ ■ ■ ,Uh with «j G I for all i, such 
that w = • • • Uh- A code over S is any nonempty set C C S + such that 
each string «; G S + has at most one decomposition over C. A prefix code over 
E is any code C over £ such that no string in C is a proper prefix of another 
string in C. If u, v are two strings, then we denote by u • v, or simply by uv 
the concatenation of u with v. 

Definition 1 Let E and A &e two alphabets. A function c : £ x S- n — > A + 7 
mtt n > 1, is ca//ed an adaptive variable- length code of order n if its extension 
c : S* — > A* ; owen fry 

• c(A) = A, 

• c((7i . . . (7 m ) = c((7i, A)c(a 2 , (Ti) . . . c(<T n+ i, fTi . . . <7 n ) . . . c((T m , (7 m _„ . . . <7 m _i), 

for all strings U\ . . . o~ m G S + 7 is injective. 

As it is clearly specified in the definition above, an adaptive code of order n 
associates a variable-length codeword to the symbol being encoded depending 
on the previous n symbols in the input data string. Let us now give an example 
in order to better understand this mechanism. 

Example 2 Let S = {a, b, c} ; A = {0,1} be alphabets, and consider the 
function c : £ x S- 1 — > A + given by Table 1. 
Table 1 



An adaptive variable-length code of order one 





a 


b 


c 


A 


a 


010 


10 





11 


b 


Oil 


01 


100 


01 


c 


11 


11 


101 


00 



One can verify that c is injective, and according to Def. 1, c is an adaptive 
variable-length code of order one. Let x = abbaba G S + be an input data 
string. Using Def. 1, we encode the string x by 

c(x) = c(a, A)c(b, a)c(b, b)c(a, b)c(b, a)c(a, b) = 11011011001110. 
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Example 3 Let E = {a, b} ; A = {0, 1} be alphabets, and c : E x S- 2 — > A + a 

function given by Table 2. It is easy to verify that c is injective, and according 
to Def. 1, c is an adaptive variable-length code of order two. 

Table 2 



An adaptive variable-length code of order two 





a 


b 


aa 


ab 


ba 


bb 


A 


a 


010 


100 





11 


110 


00 


001 


b 


Oil 


010 


10 


01 


010 


10 


111 



Let x = ababba G S + be an input data string. Using Def. 1, we encode the 
string x by 

c(x) = c(a, A)c(b,a)c(a,ab)c(b,ba)c(b,ab)c(a,bb) = 001011110100100. 

Let c : S x S- n — > A + be an adaptive variable-length code of order n, with 
n > 1. We denote by C C;(7l0 - 2 ... CTft the set {c(a, o\02 ■ ■ ■ o~h) \ o~ G E}, for all 
O\O2---0~h £ S- n — {A}, and by C Cj a the set {c(cr, A) | a G £}. We write 
C uia2 ...cT h instead of C Cjaia2 .„ ah , and C\ instead of C c> \ whenever there is no 
confusion. Also, let us denote by AC(E, A,n) the set of all adaptive variable- 
length codes of order n from S to A. The proof of the following important 
theorem can be found in [17]. 

Theorem 4 Let S and A be two alphabets, and let c : S x S- n — > A + be a 

function, n > 1. //C u is prefix code, for all u G S- n ; then c G AC(E, A, n). 



3 EAHn: an encoder based on adaptive variable-length codes 

As we have already pointed out, the aim of this section is to present a new 
lossless data compression algorithm, called EAHn, which is actually a general- 
ization, from adaptive variable-length codes of order one to adaptive variable- 
length codes of any order, of the algorithms presented in [18]. 

Let us first fix some very useful notation, which will be used in the description 
of the algorithms. Let U = (ui, «2, • • • , be a /c-tuple. We denote by U.i the 
i-th component of U, that is, U.i = Ui for all i G {1,2,..., k}. The 0-tuple 
is denoted by (). The length of a tuple U is denoted by Len{U). If q is a 
component or a tuple, and i G {1, 2, ... , k}, then we define U <\ q and U > i 
by 

• U < q = (ui, . . .,u k ,q), 

• W > i = (Ui, . . . , Mj-i, Ui+i, . . . , Mfc). 
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If A denotes an algorithm and x its input, then we denote by A(x) its output. 
Also, N denotes the set of natural numbers. 

Algorithm Huffman. It is well-known that Huffman's algorithm takes as 
input a tuple T = (/i, fi, ■ ■ ■ , f n ) of frequencies, and returns a tuple V = 
(i>i, V2, ■ ■ ■ , v n ) of codewords, such that Vi is the codeword corresponding to the 
symbol with the frequency f, for all i G {1,2, .. . ,n}. This procedure will be 
used by our encoder EAHn in order to construct an adaptive variable-length 
code of order n. More precisely, if c : Ex E- n — > A + denotes the code con- 
structed by EAHn, then we apply the Huffman's algorithm to each of the sets 
{Freqiua) > 1 | a G E} 7 where m6 E" and Freq(ua) denotes the number of 
occurrences of ua in the input string. 

Algorithm EAHn. Let E = {a±, a 2 , ■ ■ ■ , c p } be an alphabet, and let x be 
a string over E. Let q be the number of symbols occurring in x (thus, q < p). 
Let us explain the main idea of our scheme. Consider that u G E™ is some sub- 
string of the input string x. Also, let us denote by Follow(u) the set of symbols 
that follow the substring u in x. For each symbol c G Follow{u), let us denote 
by Freq{uc) the frequency of the substring uc in x. One can easily remark that 
Follow(u) cannot contain more than q symbols. Moreover, in most cases, the 
number of symbols in Follow(u) is significantly smaller than q. Instead of ap- 
plying the Huffman's algorithm to the frequencies of the q symbols occurring 
in x, we apply it to each of the sets {Freq(uc) \ c G Follow(u)} , since every 
such set usually has a smaller number of elements. If code(c, u) is the codeword 
associated to the substring uc, then we encode c by code(c, u) whenever it is 
preceded by u. Therefore, in most cases, we get smaller codewords. 

Thus, Huffman's algorithm is actually applied to every substring u of length 
n occurring in x. We associate to each symbol a set of codewords, and encode 
each symbol with one of the codewords in its set, depending on the previous n 
symbols occurring in the input string. 

The complete algorithm is given in Fig. 1. Let us now explain what exactly the 
algorithm performs at each step. The first three steps are aimed to initialize 
the functions needed. Note that the function d actually allows us to access 
the elements of E n in a certain order. In the fourth step, b(xi, x,_ n . . . 
is switched to 1, since the substring at least once in 

x, and the frequency of Xi- n . . . Xi-\Xi is incremented. In the fifth step, for 
every substring d(j) occurring in x, we apply the Huffman's algorithm to the 
set {Freq(d(j)a) \ a G Follow(d(j))} . In the next two steps, y is a tuple of 
codewords constructed as follows. If c G E and u G E n ; then a(c, u) is appended 
to y if and only if a(c,u) ^ X, that is, if c G Follow(u) and \Follow(u)\ > 2. 
Finally, in the last step, Z denotes the compression of x n+ \ ...x t . 

So, the compression of the string x is actually Z . The first three components 
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Input : a string x = X\X 2 ■ ■ ■ x t G E + . 
Output : the tuple (x±X2 ■ ■ ■ x n , b, y, Z). 



Let a : E x E n -> {0, 1}*, 6 : E x E n -> {0, 1}, and c : E x E n -» N. 
Let d : {1, 2, . . . ,p n } — > E n be a bijective function. 
For each (a, u) € E x E n do 

a(<7, u) <— A; 6((j, u) <— 0; c(cr, u) <— 
For i = n + 1 to t do 

6(x^, Xi— n . . . X^ — i) 1 

c(x^, X^_ n . . . X^ — l) c(x^, X^— n • • • Xj_ l) -|- 1 

For j = 1 to p n do 
S<-();A:<-1 
For i = 1 to p do 

If b(ai,d(j)) = 1 then 
5 «- 5 < c(ai,d(j)) 
If Lera(S) > 2 then 

V «- Huffman (5) 
For i = 1 to p do 

If = 1 then 

a(ai, d(j)) <- V.k; k <- fc + 1 

y<-();Z<-A 
For j = 1 to p n do 

For i = 1 to p do 

If a((Ji,d(j)) 7^ A then 

For i = n + 1 to t do 

Z <— Z • a(xj, Xj„ n . . . Xj_i) 



Fig. 1. EAHn 

of the output (i.e., X\X 2 ■ ■ ■ x n ,b, and y) are only needed when decoding Z into 
x. The decoding procedure is obtained by following the same steps in reverse 
order. 



Let us now take an example in order to better understand the description 
above. 

Example 5 Let E = {a, b} be an alphabet, x = baabbabab G S + an input 
data string, and let us take n = 2. After applying EAH2 to the string x, we 
get the results reported in Tables 3, 4, and 5. 

Table 3 



The function a after applying EAH2 to x 



E\E 2 


aa 


ab 


ba 


bb 


a 


A 








A 


b 


A 


1 


1 


A 
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Table 4 



The function b after applying EAH2 to x 



s\s 2 




aa 


ab 


ba 


bb 


a 







1 


1 


1 


b 




1 


1 


1 





Table 5 
The function 


c after applying 


EAH2 to x 










aa 


ab 


ba 


bb 


a 







1 


1 


1 


b 




1 


1 


2 






Let us now explain these results by considering the third column of each table, 
i.e., the column corresponding to the substring ba. In Table 4, 6(a, ba) = 1 
and 6(b,ba) = 1, since the substrings baa and bab both occur at least once 
in x. In Table 5, c(a, ba) = 1 is the frequency of baa in x, and c(b, ba) = 2, 
since bab occurs twice in x. Thus, applying the Huffman's algorithm to the set 
of frequencies {1,2} 7 we encode the symbol a by a(a, ba) = whenever it is 
preceded by ba. Also, we encode the symbol b by a(b, ba) = 1 whenever it is 
preceded by ba. 

Considering that the function d is given by d(l) = aa, d(2) = ab, d(3) = ba, 
and d(4) = bb, one can verify that the output of EAH2 in this example is the 
4-tuple 

(ba, b, (0,1, 0,1), 01101), 

where b is the function given above. Also, one can remark that the function b 
can be encoded using p n+1 bits. In our example, b can be encoded by 2 3 = 8 
bits, since p = 2 and n = 2. 

Remark 6 Let X — X\X% • • • Xf. Given that the size p of the alphabet and the 
parameter n do not depend on the input, we can conclude that the runtime of 
EAHn is linear in the size of the input. More precisely, for a fixed value of n, 
the runtime of EAHn is 0(t). If we consider the the input consists of both n 
and t, then the runtime is 0(max{p n ,t}). Thus, given that p is a constant, 
we conclude that the runtime is either exponential in the size of n or linear in 
the size of the input. However, in practice p and n are very small numbers, so 
it is unlikely that we need more than 0{t) time. 

Even though EAHn is based on adaptive variable- length codes, i.e., it encodes 
the current symbol adaptively by taking into account only the last few sym- 



7 



bols, it is an offline algorithm, since the codewords are chosen only after the 
input string has been already processed. Let us explain this in detail. Let c 
be the current symbol and u6S" the string of length n preceding c. Then, 
the codeword associated to c is chosen based on the frequency of uc in the 
entire input. EAHn can be easily transformed into an online (i.e., dynamic) 
algorithm as follows: we associate a codeword to the current symbol c based 
only on the frequency of uc in the text already processed. More precisely, if 
x = XiX 2 . . . x t is the input string and Xi is the current symbol, then the code- 
word associated to Xj is chosen based on the frequency of Xi_ n . . . Xi_\Xi in 

X\X2 ■ ■ ■ Xi- 

It is well-known [19] that, except for short messages, the online Huffman coding 
always produces a longer encoding than the offline version, the reason being 
that the offline version is optimal in this respect. In practice, the online version 
of EAHn is a good choice whenever the runtime is a critical parameter in the 
system. Moreover, as described in [19], an online compressor can be used to 
encode the message in real time. Thus, whenever the runtime is one of the most 
critical parameters in the system, the online version of EAHn is preferred over 
the static one, since the message can be dynamically encoded. 



4 Entropy bounds for EAHn 

This section focuses on computing the entropy bounds for EAHn. Since the 
EAHn encoder is based on Huffman's algorithm, let us first recall the well- 
known bounds for Huffman's algorithm. 

Definition 7 Let E be an alphabet, x an input data string of length t over £, 
and k the length of the encoder output. The compression rate of x, denoted by 
R(x), is defined by 



Let Rr(x) be the compression rate in codebits per datasample, computed 
after encoding the string x by Huffman's algorithm. One can obtain upper 
and lower bounds on Rn(x) before encoding the data string x, by computing 
the entropy denoted by En(x). Let us consider that x is a string of length t, 
(Fi, F 2 , . . . , F h ) is the /i-tuple of frequencies corresponding to the symbols in 
x, and k is the length of the Huffman's encoder output. The entropy E H (x) of 
x is defined by 



R(x) 



k 



(1) 



t 




(2) 
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Let Li be the length of the codeword associated by Huffman's algorithm to the 
symbol with the frequency F^, for all i e {1,2,..., h}. Then, the compression 
rate Ru(x) can be re- written by 



1 h 

t ,'—1 



(3) 



Relating the entropy En(x) to the compression rate Rn(x), we obtain the 
inequalities 



Let E = {o"i, 02, ... , P } be the same alphabet of size p, and let 

be a string over E. Let us denote by Reahti{ x ) the compression rate obtained 

after encoding x by EAHn. More precisely, i?EAHn(^) is given by 



where Z denotes the fourth component of the EAHn output, i.e., the encoding 
of x n+i x n+2 ■ ■ - x t . Also, let us denote by i?EAHn(^) the EAHn entropy of x. 

We can obtain upper and lower bounds on i?EAHn(^) before encoding x, by 
computing the entropy E E \n n (x). Given that EAHn uses Huffman's algorithm 
to associate a set of codewords to each substring of length n occurring in 
x\X2 ■ ■ ■ Xt-i, it is clear that the entropy E E AHn(x) can be computed as a sum 
of entropies. Let us explain this in detail, since it will help us to derive the 
final formula for -E/eah?i • 

Let J : {0, 1, . . . ,p n — 1} — > E n be a fixed bijective function. In other words, 
J(i) identifies a certain substring of length n. Thus, each substring J{i) of 
length n can be uniquely identified by its index i. Let A be the set of those 
indexes i such that J(i) occurs at least once in X1X2 ■ ■ -x t -i. It is now clear 
that 

-E'EAHn {x) = E EAnn (x, i), (6) 



where E E ^n n (x,i) denotes the entropy corresponding to those positions in x 
preceded by J(i). Now, in order to establish the formula for E E \n n (x), we 
only need to compute E E Aiin{x,i)- Let C(i) denote the set of those j's such 
that J(i) ■ <7j occurs at least once in x. In other words, C(i) denotes the set of 
those symbols in x preceded by J(i). The cardinality of C(i) corresponds to 
the number h in the general case described above. Also, let N(i) be the total 



E H (x) < Rn(x) < E n {x) + 1. 



(4) 




Z 

7' 



(5) 
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number of positions in x preceded by J(i). N(i) corresponds to the number t 
in the general case. Finally, if we denote by F(i,j) the frequency of J(i) ■ Uj 
in x, we can conclude that 

E jeCW [^,j)l0g 2 ^gy] 

EeauAx, i) = ^ (7) 



F(i,j) corresponds to the number F(i) in the general case. Thus, the entropy 
-E'EAHn(^) is actually a sum of entropies, where each such local entropy is 
computed similarly to the general case. 



5 Implementation details for EAH1 



In this section, we give complete details regarding the structure of the files 
compressed with EAH1. Specifically, the structure looks as shown in Table 6. 
Thus, we describe precisely how the output of EAH1 is encoded in a file. 

Table 6 

The structure of the files compressed with EAH1 



Zl 


Z2 


Z3 


Z4 


Z5 


Z6 


Z7 


Z8 


Z9 


8 


NF 


8 


8 


256 


256 


NF 


NF 


NF 



In our implementation, the size of the alphabet is 256, so each symbol requires 
8 bits. In the table above, each of the fields denoted Z1,Z2,. . . ,Z9 has associated 
its size in bits. NF (not fixed) denotes the size of the fields whose length 
depends on the input. Let us now describe each field separately. 

Zl. This field has a fixed length (8 bits), and specifies how many padding 
bits are used in the field Z2. Thus, the number encoded by Zl gives us the 
length of Z2. 

Z2. This second field consists of a sequence of padding bits, which are ap- 
pended to the whole structure so that the total number of bits in Zl,. . . ,Z9 
is a multiple of 8. So, the length of Z2 is a number between and 7. 

Z3. If the input data string for EAH1 is x = x\x 2 ■ ■ ■ Xt, then this field encodes 
the symbol xi, that is, the first component of the output. 

Z4. The fourth field gives us the maximum length of a codeword in y, where 
y is the third component of the output. Let us denote by MAXLC this 
number. 

Z5. For each symbol in the alphabet, we need a bit in order to specify if it 
occurs at least once in x±x 2 ■ ■ - x t -\- Let us denote by NC the number of 
bits 1 in this field. 
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Z6. For each symbol in the alphabet, we need a bit in order to specify if it 
occurs at least once in x 2 x 3 . . . x t . Let us denote by NL the number of bits 
1 in this field. 

Z7. This field consists of NL * NC bits, and together with Z5 and Z6 encodes 
the function b, that is, the second component of the output. 

Z8. The length of this field is Len(y) * (MAXLC + 1), since each of the 
codewords in y is encoded by MAXLC+ 1 bits. Actually, Len(y) is at most 
the number of bits 1 in Z7. Let us now explain how exactly a codeword in y 
is encoded using only MAXLC + 1 bits. If cw = B X B 2 ■ ■ ■ B { is a codeword in 
y, then we know that i < MAXLC. Let us denote by B the complement of 
the bit B. If i — MAXLC, then we encode cw by B\B\B 2 ■ ■ ■ B{. Otherwise, 
if i < MAXLC, we encode cw by B\uB\B 2 ■ ■ ■ B{, where u — B\ . . . B\ is a 
sequence of length MAXLC— i. The decoding works as follows. Suppose that 
C\C 2 ■ ■ ■ Cmaxlc+i is a sequence of bits denoting the encoding of a codeword 
cw, and let j be such that C 2 = . . . = Cj, but Cj ^ Cj+i (if there does not 
exist such an index, we know that the codeword cw is C 2 . . . C maxlc+i) ■ If 
Ci = C 2 , then cw is C 2 . . . Cmaxlc+i- Otherwise, if C\ ^ C 2 , we know that 
cw = Cj+i . . . Cmaxlc+i- 

Z9. Finally, the last field denotes the compression of x 2 x% . . . x t . Precisely, 
this field is actually Z, the last component of the output. 



6 Experimental results 

Given that EAHtt, is an offline algorithm, it seems natural to compare it with 
Huffman's classical algorithm. For experiments, we have chosen two of the 
most known corpora: The Calgary Compression Corpus (CCC) and The Large 
Canterbury Corpus (LCC) [20]. All the comparisons provided in the Tables 
7, 8, and 9 show specific differences between EAH1 and Huffman's classical 
encoder. As one can see, the EAH1 encoder gives significantly better results 
on both corpora. The improvement is approximately 21.20% for CCC, and 
18.99% for LCC. 

Table 7 



Results of compressing three files of the Large Canterbury Corpus (LCC) 



File 


Size 
(bytes) 


HUFFMAN 


EAH1 


Improvement 

(%) 


E.coli 


4,638,690 


1,159,677 


1,159,748 




bible.txt 


4,047,392 


2,218,595 


1,690,454 


23.80 


worldl92.txt 


2,473,400 


1,558,845 


1,148,918 


26.29 


Total 


11,159,482 


4,937,117 


3,999,120 
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Table 8 

Results of compressing fourteen ASCII files of the Calgary Compression Corpus 



(CCC) 





Sl7P 








Filp 


^uy lesj 


HUFFMAN 


FA HI 


\ /0 ) 


U1L) 


1 1 1 9fi1 


79 Q3fi 


4Q ^/LD 

4ty ,01U 


39 f!7 
oz.u / 


bookl 


Vfta 771 


438 KQn 

^toOjOyz 


3M 144 


1 q Q3 

ly .yo 




UlU,OOU 


3fi8 ^07 


9Q4 71 7 


9D D9 


news 


Q77 inq 


94fi ^xn 


9nn 379 


18 73 

lO. 1 o 


pdpei 1 


^3 1 fil 

00, 101 


33 ^30 

00,OOU 


97 flzL9 
Z / ,U4LZ 


1 Q 34 
iy .o^i 


pdpei Z 


89 1 QQ 

oz, iyy 


47 81 9 


38 ^1 1 
00,011 


iy .^to 




4fi ^9fi 


97 zlQ 
Z ( ,4LOO 


99 481 


1 s 

10. uo 


Pcipcl^J: 


1 3 98fi 

io,zoo 


s nn3 


7 ^8zL 


^ 93 

o.zo 


paper5 




7 ^Q3 

i ,oyo 


7 919 
/ ,Z1Z 
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93,695 


65,431 
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34.19 


Total 


2,367,559 


1,440,264 


1,134,835 





Table 9 










Improvements per corpus 










total compressed 


size total compressed size 


Improvement 


Corpus 


HUFFMAN 


EAH1 


(%) 


CCC 


1,440,264 


1,134,835 


21.20 


LCC 


4,937,117 


3,999,120 


18.99 


Total 


6,310,138 


5,132,328 





7 An EREW PRAM version of EAHn 

This section focuses on describing a parallel version of EAHn, using the well- 
known PRAM model of computation [7]. As we shall see, EAHn is highly 
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parallelizable, mostly because during its execution, Huffman's algorithm is 
applied to disjoint sets of frequencies. 

Let us first recall some basic concepts. The shared-memory model consists of a 
number of processors that have access to a single shared memory unit, usually 
referred to as global memory. Each processor has its own local memory and can 
execute its own local program. Processors communicate by exchanging data 
through the shared memory unit. Each processor is identified by its unique 
index, which is available in its local memory. A general view of this model is 
given in Fig. 2. 



Shared Memory 


I 




2 ... t 


\ 



Fig. 2. The Shared-memory Model 



If all the processors work synchronously under the control of a common clock, 
then the shared memory model is usually called the 'parallel random-access 
machine (PRAM) model. Throughout this section, we use only the exclusive- 
read exclusive-write (EREW) PRAM submodel, i.e., the PRAM submodel that 
does not allow any simultaneous access to a single memory location. For more 
details, the reader is referred to [7]. 

Let £ = {<Ti, cr 2 , . . . , a p } be the same alphabet of size p, and x = X\x 2 . . . x t an 
input string over S. Suppose that we have q = p n EREW PRAM processors, 
denoted p±,p 2 , ■ ■ ■ ,p q - Then, we can assign one processor to each substring 
u G S™, since the cardinality of S n is p n . Let d : {1,2,... ,p n } — > S n be a fixed 
bijective function. For each % G {1,2,... ,p n }, processor pi is assigned to the 
string d(i). 

For each u G S n that occurs at least once in x\x 2 ■ ■ -x t -i, EAHn applies the 
Huffman's algorithm to the set of frequencies {Freq(uc) \ c G Follow(u)}, as 
described in section 3. Therefore, in the parallel version, processor pi applies 
the Huffman's algorithm to the set {Freq(d(i)c) \ c G Follow(d(i))} if d(i) 
occurs at least once in Xix 2 . . . x t -i- 

The complete algorithm is given in Fig. 3. Note that the functions a, b, c, and 
d must reside in the global memory, such that each processor can access its 
corresponding memory locations. Given that the processors do not read from 
(or write into) the same memory locations, we conclude that our algorithm 
runs under the EREW PRAM model. 

The first two steps are aimed to allocate the necessary space in the shared 
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memory for the functions a, 6, c, and d. The third step is a parallel one. More 
precisely, each processor pi initializes a(E,d(i)), b(E,d(i)), and c(E, As 
one can remark, the fourth step is not executed in parallel. Therefore, we can 
consider that it is executed by processor p 1 . 



Input : a string x = X1X2 ■ ■ ■ x t £ S + . 
Output : the tuple (x±X2 ■ ■ ■ x n , b, y, Z). 



Let a : E x S n -> {0, 1}*, b : £ x S n -> {0, 1}, and c : £ x £ n -» N 
be three functions with the necessary space allocated in the global 
memory. 

Let d : {1, 2, . . . — > £ n be a bijective function with necessary 
space allocated in the global memory. 
For each u € £ n pardo 
For each a € £ do 

a(<7, u) <— A; 6(<7, u) <— 0; c(<7, u) <— 
For i = n + 1 to t do 

6(x^, Xi— n . . . X^ — i) 1 

c(x^, X^_ n . . . X^ — l) c(x^, X^— n • • • Xj_ l) ~t- 1 

For j = 1 to p n pardo 
S<-();A:<-1 
For z = 1 to p do 

If b(ai,d(j)) = 1 then 
5 «- S < c{(Ji,d{j)) 
If Lera(S) > 2 then 

V «- Huffman (5) 
For i = 1 to p do 

If b(ai,d(j)) = 1 then 

a(<jj, <- V./c; fc <- fc + 1 

y^0-,z^x 

For j = 1 to p n pardo 

%■<-() 

For i = 1 to p do 

If a((Ji,d(j)) 7^ A then 

y t ^yj < o(<Ti,dO')) 

For j = 1 to p n do 

Append the components of 3^' to ^ 
For i = n + 1 to t do 

Z <— Z • a(xj,Xi_ n . . .Xj_i) 



Fig. 3. An EREW PRAM version of EAHn 

In the fifth step, each processor pi applies the Huffman's algorithm to the set 
{Freq(d(i)c) | c G Follow(d(i))} if and only if this set has at least two elements. 
The last parallel step is step 7, which constructs the output tuple y. Steps 6, 
8, and 9 are executed only by one of the processors, i.e., they are not parallel 
steps. For example, step 8 is not executed in parallel since the components of 
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y must be in a certain order. 

Runtime. It is clear that the runtime is 0(max{p n , t}), where p is the con- 
stant size of the alphabet. However, given that most of the steps are executed 
in parallel, the constant hidden under the O-notation is significantly smaller 
than in the sequential version of EAHn. For example, in the third step, Huff- 
man's algorithm is applied in parallel to different sets of frequencies. Given 
that this is one of the most time-consuming steps, it is clear that it will reduce 
significantly the runtime. 



8 Conclusions and future work 

Adaptive variable- length codes have been recently presented in [17,18] as a new 
class of non-standard variable-length codes. New algorithms for data compres- 
sion, based on adaptive codes of order one and Huffman's algorithm, have been 
also presented in [18]. In this paper, we extended the work done so far by the 
following contributions: first, we proposed an improved generalization of the 
algorithms presented in [18], called EAHn. Second, we computed the entropy 
bounds for EAHn, using the well-known bounds for Huffman's algorithm. 
Third, we discussed implementation details and gave reports of experimental 
results obtained on some well-known corpora. Finally, we described a parallel 
version of EAHn using the PRAM model of computation. 

One of the most ambitious future plans is to investigate the possibility of incor- 
porating the algorithms presented here into the projects currently developed 
by some industrial laboratories [5]. 
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